Systems and methods for a full text search engine

ABSTRACT

Aspects of the current patent document include systems and methods to full text search engines. In embodiments, a full text search engine is implemented in object storage. In embodiments, a distributed database index is used in conjunction with the object storage. In embodiments, the distributed database is encrypted and moved to object storage. In embodiments, object storage stores a plurality of blocks containing words. In embodiments, each block can contain one million words.

CROSS-REFERENCE TO RELATED APPLICATION

This patent application claims the priority benefit of co-pending andcommonly-owned Indian Provisional Application 201711012654, filed onApr. 7, 2017, entitled “SYSTEMS AND METHODS FOR A FULL TEXT SEARCHENGINE,” and listing Milind Borate, Yogendra Acharya, and Anand Apte asinventors (Docket No. 20133-2083IN), which patent document isincorporated by reference herein in its entirety and for all purposes.

BACKGROUND

A. Technical Field

The present invention relates generally to data storage and searching,and relates more particularly to a full text search engine.

B. Description of the Related Art

As the value and use of information continues to increase, individualsand businesses seek additional ways to process and store and searchinformation. One option available to users is information handlingsystems. An information handling system generally processes, compiles,stores, searches and/or communicates information or data for business,personal, or other purposes thereby allowing users to take advantage ofthe value of the information. Because technology and informationhandling needs and requirements vary between different users orapplications, information handling systems may also vary regarding whatinformation is handled, how the information is handled, how muchinformation is processed, stored, or communicated, and how quickly andefficiently the information may be processed, stored, or communicated.The variations in information handling systems allow for informationhandling systems to be general or configured for a specific user orspecific use, such as financial transaction processing, airlinereservations, enterprise data storage, or global communications. Inaddition, information handling systems may include a variety of hardwareand software components that may be configured to process, store, andcommunicate information and may include one or more computer systems,data storage systems, and networking systems.

Information handling systems also need a mechanism to index and searchthe information stored. Storing information in a way that it can beindexed and searched easily and quickly is expensive. Prior art indexingsystems use hard disk drives to store information and create an indexthat can be searched responsive to a search query. FIG. 1 depicts ahigh-level block diagram of an indexing system. FIG. 1 shows indexingsystem 100 including a list of words with their corresponding documentidentification (doc id) 105 and index 110. As new words are added to theindex 115, the words have the ability to be searched 120. A documentidentifier 125 can be output responsive to a search query.

One shortcoming of the prior art indexing scheme is that it is difficultto scale. As the index grows, more hard disk drive space is needed. Harddisk drives typically have to be connected to a machine or a computer.The hard disk drive and the machine must remain on in order for theindex to updated or searched. It can be expensive to run the hard diskdrives and the machines connected to them at all times. Further, as theindex grows, more disk drive space is needed and typically more than onedisk drive and more than one machine is used.

Accordingly, what is needed are systems and methods that improve storageand indexing of full text search engines and provide additionalscalability.

BRIEF DESCRIPTION OF THE DRAWINGS

References will be made to embodiments of the invention, examples ofwhich may be illustrated in the accompanying figures. These figures areintended to be illustrative, not limiting. Although the invention isgenerally described in the context of these embodiments, it should beunderstood that it is not intended to limit the scope of the inventionto these particular embodiments.

FIG. 1 depicts a high-level block diagram of an indexing systemaccording to embodiments in this patent document.

FIG. 2 depicts a block diagram of a merge operation in an indexingsystem according to embodiments of the present disclosure.

FIG. 3 depicts a block diagram of a B+ tree structure indexing systemaccording to embodiments of the present disclosure.

FIG. 4 depicts a block diagram of an indexing system according toembodiments of the present invention.

FIG. 5 depicts a block diagram of an indexing system using objectstorage and a distributed database according to embodiments of thepresent invention.

FIG. 6A depicts a block diagram of an indexing system using objectstorage and a distributed database using encryption according toembodiments of the present invention.

FIG. 6B depicts a block diagram of an indexing system using objectstorage and a single distributed database entry according to embodimentsof the present invention.

FIG. 7 depicts a flowchart depicting a process of creating an indexaccording to embodiments of the present invention.

FIG. 8 depicts a flowchart depicting a process of searching an indexaccording to embodiments of the present invention.

FIG. 9 depicts a block diagram of a computer system according toembodiments of the present invention.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

In the following description, for purposes of explanation, specificdetails are set forth in order to provide an understanding of theinvention. It will be apparent, however, to one skilled in the art thatthe invention can be practiced without these details. Furthermore, oneskilled in the art will recognize that embodiments of the presentinvention, described below, may be implemented in a variety of ways,such as a process, an apparatus, a system/device, or a method on atangible computer-readable medium.

Components, or modules, shown in diagrams are illustrative of exemplaryembodiments of the invention and are meant to avoid obscuring theinvention. It shall also be understood that throughout this discussionthat components may be described as separate functional units, which maycomprise sub-units, but those skilled in the art will recognize thatvarious components, or portions thereof, may be divided into separatecomponents or may be integrated together, including integrated within asingle system or component. It should be noted that functions oroperations discussed herein may be implemented as components. Componentsmay be implemented in software, hardware, or a combination thereof.

Furthermore, connections between components or systems within thefigures are not intended to be limited to direct connections. Rather,data between these components may be modified, re-formatted, orotherwise changed by intermediary components. Also, additional or fewerconnections may be used. It shall also be noted that the terms “link,”“linked,” “coupled,” “connected,” “communicatively coupled,” or theirvariants shall be understood to include direct connections, indirectconnections through one or more intermediary devices, and wirelessconnections.

Reference in the specification to “one embodiment,” “preferredembodiment,” “an embodiment,” or “embodiments” means that a particularfeature, structure, characteristic, or function described in connectionwith the embodiment is included in at least one embodiment of theinvention and may be in more than one embodiment. Also, the appearancesof the above-noted phrases in various places in the specification arenot necessarily all referring to the same embodiment or embodiments.

The use of certain terms in various places in the specification is forillustration and should not be construed as limiting. A service,function, or resource is not limited to a single service, function, orresource; usage of these terms may refer to a grouping of relatedservices, functions, or resources, which may be distributed oraggregated. Furthermore, the use of memory, database, information base,data store, tables, hardware, and the like may be used herein to referto system component or components into which information may be enteredor otherwise recorded.

The terms “include,” “including,” “comprise,” and “comprising” shall beunderstood to be open terms and any lists the follow are examples andnot meant to be limited to the listed items. Any headings used hereinare for organizational purposes only and shall not be used to limit thescope of the description or the claims.

Furthermore, it shall be noted that: (1) certain steps may optionally beperformed; (2) steps may not be limited to the specific order set forthherein; (3) certain steps may be performed in different orders; and (4)certain steps may be done concurrently.

The present invention relates in various embodiments to devices,systems, methods, and instructions stored on one or more non-transitorycomputer-readable media involving full text search indexing. Suchdevices, systems, methods, and instructions stored on one or morenon-transitory computer-readable media can result in, among otheradvantages, full text search indexing.

In a full text search index system, a buffer can be used to store wordsalong with identification. The words can be terms, documents, files,names or any other item capable of being stored and searched. For thepurpose of this document, the above terms are used interchangeably.

In an indexing system words and an identification path can be stored inan index that can be searched. As described above, one disadvantage theprior art indexing systems is that the indexes cannot scale easily andrequire adding more hard disk drives and more computers along with thehard disk drives. Embodiments described herein overcome thoselimitations by using object storage.

FIG. 2 depicts a block diagram of a merge operation 200 in an indexingsystem according to embodiments of the present disclosure. FIG. 2 showstwo buffer indexes 205 and 215, both already in sorted order. In theembodiment shown in FIG. 2 the two indexes are merged into one index210. Index 210 also is sorted. For example, in FIG. 2, index 205contains A, P, and Y and document identification. Index 215 contains B,L, and M and document identification. When the two indexes are mergedinto index 210, containing A, B, M, L, P, and Y and documentidentification. The contents are in sorted order when merged in index210.

In the system shown in FIG. 2, buffers can be used to build indexes. Forexample, a buffer can be used to store A, P, and Y. Once the buffer isfull it can be saved to a hard disk and a new buffer started. In theexample shown in FIG. 2, the new buffer contains B, L, and M. Once thatbuffer is full, it can also be saved to a hard disk. The two indexes canbe merged as shown in FIG. 2 on the hard disk.

FIG. 3 depicts a block diagram 300 of a B+ tree structure indexingsystem according to embodiments of the present disclosure. FIG. 3 showsanother representation of an indexing system in a B+ tree structure. TheB+ tree structure includes a plurality of nodes. Node 302 is a root nodein the tree structure. The root node indicates a lookup, similar to adictionary, where words can be found in the tree. For example, in theembodiment shown in FIG. 3, root node contains an indication that wordsless than apple 305 can be found in internal node 322, words betweenapple and cherry can be found in internal node 332, words between cherryand orange can be found in internal node 352. The tree structure cancontinue with leaf nodes 362 and 382. The example shown in FIG. 3 ismerely exemplary and not intended to limit the embodiments to anyparticular number of nodes or number of layers on the tree.

As the tree grows more and more indexes can be built and more machinesand hard disk drives can be used. However, this tree structure has thedrawbacks of expensive scalability due to the use of hard disk drivesrequiring machines to remain on so the search can be performed.

FIG. 4 depicts a block diagram of an indexing system 400 according toembodiments of the present invention. FIG. 400 shows indexes 410, 415,and 420 merged into index 425. Further, other indexes can be merged intoindexes 430 and 435. As the index grows each index can be started on anew machine 425, 430, and 435 can all be stored on different hard diskdrives on different machines. Since the index can grow large enough tobe stored on a plurality of machines, all machines need to be searchedin order to search the index. Therefore, the embodiment shown in FIG. 4also shows a map 405 to map the various indexes 425, 430, and 435. Itshall be understood that the number of indexes and mappings is notlimited to the numbers shown in FIG. 4.

FIG. 5 depicts a block diagram of an indexing system 500 using objectstorage and a distributed database according to embodiments of thepresent invention. In the embodiments shown in FIG. 5, object storage isimplemented to store data merged from the indexes described in referenceto FIGS. 2-4. Object storage is a computer storage architecture thatstores data as objects. Indexes can be stored as objects 530, 535, and540 as shown in FIG. 5. In embodiments, objects can be stored in objectstorage. Objects can also have object keys. In embodiments, an objectkey is a name assigned to an object used to retrieve the object.

Indexes 515, 520, and 525 can be merged into index 530. It shall beunderstood that the embodiments described are not limited to aparticular number of indexes or objects. In embodiments, objects 515,520, and 525 can be stored in object storage 505. Object storageprovides significant advantages over conventional hard disk drivestorage. For example, object storage can be used to build the index andcan be large, but does not have to maintained on a hard disk drive witha machine that has to be kept on. Also, the cost of object storage isconsiderably less than the cost of hard disk drive storage. A costsavings of a factor of 10 can be enjoyed by using object storage overconventional hard disk drive storage.

In embodiments, a distributed database 510 can be used to map to thevarious objects 530, 535, and 540 stored in object storage 505.Distributed database index 510 can be updated as the merge shown mergingblocks 515, 520, and 525 to object 530. Pointers can be used from thedistributed index 510 to the object storage 505 to indicate what objectto perform the search in object storage 505. Pointers are shown on FIGS.5 as 560, 565, and 570.

In one example of the embodiments shown in FIG. 5, one million words canbe stored in each object 530. In an example of the embodiments shown inFIG. 5, 1000 objects can be stored each object. Thus, in this example,one billion terms can be stored.

As the size of the index grows, the index does not fit in a singleobject. In embodiments, the object is split among multiple objects. Inembodiments, a plurality of object keys in the index can be sorted andthe first keys can be stored in a first object, the second set in asecond object, etc. Thus, an object is split into a plurality ofobjects.

As more terms are added or more indexes are merged, more storage spacecan be required. Index that requires more space can be stored inmultiple objects so that individual object size does not grow too big.The objects can be ordered such that high order objects store higherorder terms. This ordering can be stored using the distributed database510. In the embodiment shown in FIG. 5, the object order can be storedin the distributed database 510, whereas leaf nodes can be stored in theobject storage 505.

The embodiments shown in FIG. 5 take advantage of the fact that objectstorage is designed to write once and read multiple times. Since objectstorage is designed to be written to once, the distributed database 510can serve as the root and internal nodes 545, 550, and 555, includingwords apple, cherry, and orange, similar to the root nodes shown in FIG.3. Distributed dataset 510 can be updated as more objects are added tothe object storage.

In one embodiment, the order of objects that is stored in distributeddatabase can be moved to an object to avoid accessing the distributeddatabase during search. For searching, the object that stores the orderis loaded first to identify the leaf node that contains the term beingsearched. FIG. 6A depicts a block diagram of an indexing system 600using object storage and a distributed database using encryptionaccording to embodiments of the present invention. FIG. 6A shows objectstorage 605 and distributed database 610. In the embodiments shown inFIG. 6A, root nodes 640, 645, and 650 can be combined 655 and encryptedand stored in object storage 605. In one embodiment 50 nodes arecombined and encrypted and moved to object storage as 620. In anotherembodiment, any number of nodes in the distributed database 610 can becombined, encrypted, and stored in object storage 605. In oneembodiment, 1000 nodes can be combined and stored in object storage 605.A tree structure can also be built in object storage shown in FIG. 6A asroot node 620 and leaf nodes 625, 630, and 635. Any number of nodes canbe used in this index and any number of internal nodes can be used inthe tree in object storage.

In embodiments, all the objects are encrypted and an initializationvector (IV) for the encryption is a combination of index id and theposition of the object in the index. In embodiments, another level ofindex can be used as shown in FIG. 6A. The level 1 index 625, 630, or635 can use the last key indexed by level 0 object 620 as the key andthe position of the level 0 object 620 as the value.

FIG. 6B depicts a block diagram of an indexing system using objectstorage and a distributed database according to embodiments of thepresent invention. FIG. 6B shows an object storage 675 and a distributeddatabase 676. FIG. 6B shows a block diagram having a level 1 index 677and a level 0 index 678, 679, and 680.

During search, level 1 index 677 is searched first to locate from thelevel 0 objects (apple, cherry, orange) 678, 679, 680, the level 0object that contains the key being searched. Once the level 0 object isidentified, it can be loaded and search the given key within thatobject. If the level 1 index grows too big to fit in a single object, itcan be split and a level 2 index (not shown) can be built to improvesearch efficiency and so on.

The higher-level objects can also be encrypted, and the IV is composedof index id, the level and position of object in higher level index.

During the merge process, as level 0 objects are added, entries in alevel 1 object can be made and stored (instead of adding distributeddatabase entries to record the order of level 0 objects). There are twochallenges with this system. One, objects can only use eventualconsistency when it comes to updates to the same object. Two, the sameIV cannot be used to encrypt multiple data blocks because that increasesthe chances of an intruder cracking the encryption key. To overcome thefirst challenge, every time a higher-level object is stored a differentpath is used so that it gets treated as a new object write rather thanan update.

To overcome the second challenge, a modification counter can be added tothe IV. When an object is stored for the first time, the modificationcounter is 1. When another key is added and the object stored a secondtime, the modification counter is incremented and so on for thesubsequent keys added. The final modification sets the counter to 0. Inembodiments, the same modification counter is used to build new objectpath for each change. In embodiments, the final object is found at apredictable path with a predictable IV because the modification counteris set to 0.

During the index build process, a single distributed database 676 entryto mark the top-level object (at that point in the index build process)can be used, e.g., 676. The database entry 676 can store the level andmodification counter for the top-level object.

FIG. 7 depicts a flowchart depicting a process 700 of creating an indexaccording to embodiments of the present invention. FIG. 7 shows storingblocks in object storage 705. The blocks can be stored as objects withan object key. The blocks can be merged as indexes to store in theobject in object storage 710. FIG. 7 also shows creating a distributeddatabase 715. A pointer can be used to map from an object in thedistributed database to an object in the object storage 720. Inembodiments, encryption can be used to encrypt the object storage 725.In embodiments encryption can also be used to encrypt the distributeddatabase as a combination of x objects and moving the encryptedcombination to object storage 730. A tree structure can be built withthe objects in object storage 735.

FIG. 8 depicts a flowchart depicting a process 800 of searching an indexaccording to embodiments of the present invention. FIG. 8 showssearching the distributed database 605 as described above with referenceto FIG. 6A. FIG. 8 also shows responsive to the pointer being found, usethe pointer to point to the object storage 810. Search the object inobject storage, based on the pointer information 815. FIG. 8 also showsreturning the search query responsive to the search query beingsatisfied 820.

In embodiments, a modification counter can be used as described withreference to FIG. 6A.

FIG. 9 depicts a block diagram of a computer system 900 according toembodiments of the present invention. It will be understood that thefunctionalities shown for system 900 may operate to support variousembodiments of an information handling system—although it shall beunderstood that an information handling system may be differentlyconfigured and include different components. As illustrated in FIG. 9,system 900 includes a central processing unit (CPU) 901 that providescomputing resources and controls the computer. CPU 901 may beimplemented with a microprocessor or the like, and may also include agraphics processor and/or a floating-point coprocessor for mathematicalcomputations. System 900 may also include a system memory 902, which maybe in the form of random-access memory (RAM) and read-only memory (ROM).

A number of controllers and peripheral devices may also be provided, asshown in FIG. 9. An input controller 903 represents an interface tovarious input device(s) 904, such as a keyboard, mouse, or stylus. Theremay also be a scanner controller 905, which communicates with a scanner906. System 900 may also include a storage controller 907 forinterfacing with one or more storage devices 908 each of which includesa storage medium such as magnetic tape or disk, or an optical mediumthat might be used to record programs of instructions for operatingsystems, utilities and applications which may include embodiments ofprograms that implement various aspects of the present invention.Storage device(s) 908 may also be used to store processed data or datato be processed in accordance with the invention. System 900 may alsoinclude a display controller 909 for providing an interface to a displaydevice 911, which may be a cathode ray tube (CRT), a thin filmtransistor (TFT) display, or other type of display. The computing system900 may also include a printer controller 912 for communicating with aprinter 913. A communications controller 914 may interface with one ormore communication devices 915, which enables system 900 to connect toremote devices through any of a variety of networks including theInternet, an Ethernet cloud, an FCoE/DCB cloud, a local area network(LAN), a wide area network (WAN), a storage area network (SAN) orthrough any suitable electromagnetic carrier signals including infraredsignals.

In the illustrated system, all major system components may connect to abus 916, which may represent more than one physical bus. However,various system components may or may not be in physical proximity to oneanother. For example, input data and/or output data may be remotelytransmitted from one physical location to another. In addition, programsthat implement various aspects of this invention may be accessed from aremote location (e.g., a server) over a network. Such data and/orprograms may be conveyed through any of a variety of machine-readablemedium including, but are not limited to: magnetic media such as harddisks, floppy disks, and magnetic tape; optical media such as CD-ROMsand holographic devices; magneto-optical media; and hardware devicesthat are specially configured to store or to store and execute programcode, such as application specific integrated circuits (ASICs),programmable logic devices (PLDs), flash memory devices, and ROM and RAMdevices.

Embodiments of the present invention may be encoded upon one or morenon-transitory computer-readable media with instructions for one or moreprocessors or processing units to cause steps to be performed. It shallbe noted that the one or more non-transitory computer-readable mediashall include volatile and non-volatile memory. It shall be noted thatalternative implementations are possible, including a hardwareimplementation or a software/hardware implementation.Hardware-implemented functions may be realized using ASIC(s),programmable arrays, digital signal processing circuitry, or the like.Accordingly, the “means” terms in any claims are intended to cover bothsoftware and hardware implementations. Similarly, the term“computer-readable medium or media” as used herein includes softwareand/or hardware having a program of instructions embodied thereon, or acombination thereof. With these implementation alternatives in mind, itis to be understood that the figures and accompanying descriptionprovide the functional information one skilled in the art would requireto write program code (i.e., software) and/or to fabricate circuits(i.e., hardware) to perform the processing required.

One skilled in the art will recognize no computing system or programminglanguage is critical to the practice of the present invention. Oneskilled in the art will also recognize that a number of the elementsdescribed above may be physically and/or functionally separated intosub-modules or combined together.

It will be appreciated to those skilled in the art that the precedingexamples and embodiment are exemplary and not limiting to the scope ofthe present invention. It is intended that all permutations,enhancements, equivalents, combinations, and improvements thereto thatare apparent to those skilled in the art upon a reading of thespecification and a study of the drawings are included within the truespirit and scope of the present invention.

What is claimed is:
 1. A full text search index system for executingfull text searching, the system comprising: an object storage configuredto store an object containing a plurality of words and a plurality ofidentifiers, each word having an associated identifier; a writabledistributed database index to the object storage; and a pointer pointingfrom the distributed database index to the object in object storage. 2.The system of claim 1, wherein the distributed database comprises aplurality of objects and each object has a pointer associated with itpointing to a corresponding object in the object storage.
 3. The systemof claim 1 further comprising an encryption tool configured to encrypt aplurality of blocks within distributed database index and move theencrypted plurality of blocks to the object storage.
 4. The system ofclaim 1, wherein the writable distributed database index is updated asdata is added to the object storage.
 5. The system of claim 1 furthercomprising a modification counter implemented to count the number oftimes an object is stored and the count is used as a part of aninitialization vector (IV) when encrypting the object.
 6. The system ofclaim 1, wherein the object in object storage stores is encrypted. 7.The system of claim 1, wherein the object in object storage is a mergedindex.
 8. A method of building a full text search index, the methodcomprising: storing an object containing a plurality of words and aplurality of identifiers, each word having a corresponding identifier,in object storage; and creating a distributed database index with apointer to a particular object in object storage.
 9. The method of claim8 further comprising encrypting a plurality of objects in thedistributed database index and moving the encrypted plurality of objectsto object storage.
 10. The method of claim 8 further comprising merginga plurality of indexes into the object in object storage.
 11. The methodof claim 8, wherein the distributed database index is updated as data isadded to the object storage.
 12. The method of claim 8 furthercomprising counting the number of times an object is stored using amodification counter.
 13. The method of claim 8, wherein the object inobject storage is encrypted.
 14. The method of claim 8, wherein theobject in object storage is a merged index.
 15. A method of searching anindex, the method comprising: searching an object in a distributeddatabase, responsive to receiving a search query; following a pointer toan object in object storage, responsive to locating the search term inthe distributed database; searching the object in object storage; andreturning a search result, responsive to locating the satisfying thesearch query.
 16. The method of claim 15 further comprising decryptingdata stored in the distributed database.
 17. The method of claim 15further comprising decrypting data stored in the object storage.
 18. Themethod of claim 15, wherein a full text search of the index is capableof being performed.
 19. The method of claim 15, wherein the block inobject storage is encrypted.
 20. The method of claim 15, wherein theblock in object storage is a merged index.